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Abstract: Having pure samples of quark and gluon jets would greatly facilitate the study 
of jet properties and substructure, with many potential standard model and new physics 
applications. To this end, we consider multijet and jets+X samples, to determine the purity 
that can be achieved by simple kinematic cuts leaving reasonable production cross sections. 
We find, for example, that at the 7TeV LHC, the pp — )• 7+2jets sample can provide 98% 
pure quark jets with 200 GeV of transverse momentum and a cross section of 5 pb. To get 
10 pb of 200 GeV jets with 90% gluon purity, the pp — )• 3jets sample can be used. 6+2jets is 
also useful for gluons, but only if the 6-tagging is very efficient. 



1. Introduction 



Proton colliders, like the Large Hadron Collider at CERN, produce an enormous number of 
high energy jets. These jets are manifestations of hard quarks or gluons produced at very short 
distances, which shower and fragment into collections of collinear particles. Being able to dis- 
tinguish quark and gluon jets could be extremely useful for new physics searches. For example, 
many models with supersymmetry produce dominantly quark jets while their backgrounds 
are dominantly gluon jets. The hope is then to discriminate signal from background by using 
observables like jet mass, which are strongly correlated with flavor ^, ^ ^, ^, |6|, 0, In 
order to validate these observables on data, it would be useful to have relatively pure samples 
of light quark or gluon jets to study. It is the purpose of this paper to suggest where those 
samples might be found. 

At leading order in perturbation theory, there is no ambiguity in what is meant by the 
quark and gluon jet fraction in any exclusive sample. For example, as we show below, in a 
300 GeV dijet sample at the 7TeV LHC, the division is roughly 50/50. This comes simply 
from the ratio of the LO cross sections for the various channels, which do not interfere. The 
fraction can be defined beyond leading-order as well. In fact, it is well-defined to to all orders 
in perturbation theory up to the same power corrections that affect any jet algorithm's parton 
correspondence. These power corrections involve the jet size R (equivalently the jet's mass- 
to-energy ratio m/E) and AqcdZ-S. One can also define an infrared-safe definition of flavor 
at the jet level but that is not the subject of this paper. We further discuss the theoretical 
issues associated with defining quark and gluon jets in Section ^ 

To be clear, we do not propose that the quark and gluon fractions can be measured 
directly in data. Instead, one can measure observable properties of the samples, such as the 
jet mass, and compare them to theoretical predictions, such as from Monte Carlo simulations. 
The purity calculations in this paper suggest regions where the measurements would be most 
enlightening. 

It may not be obvious why one would want pure samples of quark or gluon jets at all. 
Instead, one could just study the jet observables directly in any mixed sample. For example, 
it is well known that the distribution of jet mass for 300 GeV jets is typically wider and peaks 
at larger values for gluon jets than quark jets. In a 50/50 sample, such as the 300 GeV dijet 
sample, one could then to hope to find two separated peaks. Unfortunately, the combined 
distribution does not have two distinct peaks for jet mass, or charged particle count, or any 
other known discriminant — the distributions are just too broad. Moreover, correlations in 
the 2D distribution of observables like jet mass and charged particle count might take different 
forms that would be impossible to see in a 50/50 sample. The purer the sample, the closer 
one can come to studying quark and gluon jets on an event-by-event basis. 

In this paper, we simulate a wide variety of processes at tree level for the 7TeV LHC. 
These include events with gluon and light quark {uds) jets, 6-jets, Ws, Z's and 7's. We 
begin using only the experimentally minimal cuts. Then we find kinematic cuts, such as on 
rapidity differences, which further purify the samples. Section ^ describes the event samples 
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and Section ^ the purification procedure. Section ^ discussed theoretical issues associated 
with defining quark and gluon fractions in perturbation theory. Section |5| summarizes the 
results. 



2. Starting Samples to Explore and Purify 

All events were generated with madgraph v4.4.26 [p!^], a tree-level matrix element generator, 



using leading order CTEQ6L1 PDFs |14]. Working only at tree-level makes our results 
independent of any jet-algorithm and showering/hadronization routine. Of course, we do not 
expect the efficiencies we find to agree with efficiencies one would get after full simulation, or 
in data, but this is a simple and informative way to determine where quark and gluon jets 
can be found. 

For each sample and each pT, 200,000 events were generated with the following cuts: 

• p'rp > Pt for all 'jets', meaning any gluons or uds quarks. 

• > 20 GeV for any photons 

• > 20 GeV for any leptons from W or Z decays (including missing Et from neutrinos) 

• > 20 GeV for any h quarks. 

• \ri\ < 2.5 for any jet, 6, photon, or charged lepton. 

• AR > 1.0 between any two jets. 

• AR > 0.5 between any jet and any photon or between any jet and charged lepton. 

Since the quark and gluon fractions, as well as jet properties, can be strongly dependent 
on pt, we have to be careful about how we divide the sample into different px bins. We will 
often find that it is the softest jet in a sample, such as the softest jet in the 3jet or 7+2jet 
sample, which leads to the highest purity. Since the cross sections fall rapidly with pT, the 
majority of events for a given pT cut will fall around that minimum pT- This is why all jets 
in a given sample must be above the given px, with 'jet' here referring only to light quarks or 
gluons. In the 200 GeV bjj sample, for example, each light quark or gluon is required to have 
a. pt > 200 GeV, but the b is only required to have a pT ^ 20 GeV. In 2-object final states 
like 7+jet, the 7 automatically also satisfies the jet px requirement. 

Samples where only one jet satisfies the hard pT cut, with the others having a pr > 20 GeV 
cut, were also examined. These have larger cross sections, but only the hardest jet tends to 
fall within the px range of interest, and the kinematic cuts required to achieve high purities 
reduced the cross section below the softest-jet samples discussed here. 

The starting cross sections are shown in Figure |l|, as a function of the pT cut applied 
to all light quarks and gluons. along with the other cuts listed above. If a sample has a 
bigger starting cross section, it will be able to suffer harder purification cuts while retaining a 



- 2 - 



substantial number of events. In this plot, the tt sample includes the semi-leptonic branching 
ratios (2 leptons, 2 6's and 2 light quarks) and has the px cut applied to only one of the 
light quark jets. Despite this looser cut, the cross section drops precipitously above 200 GeV, 
mostly due to the requirement that the jets be separated by AR > 1. Since the semi-leptonic 
ti cross section is very small compared to the other processes, we conclude that ti events are 
not a good way to get a large quark jet sample, despite the fact that jets coming from the 
hadronic W decay are 100% quark. 

Instead of putting a cut on the pT of all the jets, we also tried sorting jets by their 
rapidity. For example, we asked how often the most (or least) central jet is initiated by a 
particular parton. This was never more effective at purification than sorting by pT- Rapidity 
differences will be used to further purify the samples, but for the starting distributions, we 
stick with the pT cut. 

In the following, 'quark jet' will always mean only u, d, and s quarks. Any 6's and c's are 
treated as perfectly taggable, although it is straightforward to put in the tagging efficiencies. 
In Figures ^ through ^ we show the fraction of quarks and gluons produced in the various 
samples as a function of pr- When dijet events are referred to as 'QG', that means one jet is 
a gluon and the other is always a uds quark. The fraction of dijet events that are 'QG' does 
not include cases with 6 or c jets in the numerator or the denominator. 

In Figure |5|, we show the probability that a given jet is a quark or gluon as a function 
of pt for the different samples, assuming one jet is picked at random. We see that 7-l-ljet or 
W/Z+2jets are good for quark jets, and 6-|-2jets or the 3 or 4-jet samples are good for gluon 
jets. Again, this is just for the generic cuts listed above, and we have not yet attempted to 
purify the samples using rapidity or other kinematic information. 

In order to purify the samples, we can go two ways. One approach is to reject events 
so that all of the jets in the remaining events have either all quark jets or all gluon jets. In 
the top panels of Figure ^, we show the fraction of events where all jets are quark or gluon. 
Note that the vertical axis in these plots is logarithmic. The other approach is to look at 
particular jets in an event, eventually hoping to apply kinematic cuts to purify the quark or 
gluon content of that jet. (Such cuts are the topic of the next section.) In the bottom of 
Figure we show the fraction where the hardest or softest jet is quark. These starting points 
indicate that quark jets will be easier to purify than gluon jets. 
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Figure 1: Leading order cross sections, including kinematic cuts and branching ratios for Z/W decay 
to include an electron or muon. The a;-axis indicates the px cut applied to all light quarks and 
gluons, but not 6-quarks. The constraint on the px for fe's, photons, and charged leptons or neutrinos 
from Z/W (though not the Z/W itself) is fixed at 20 GeV. Note that the 3-jet cross section falls below 
6+2jets due to the harder cuts on the non-& jets. The tt cross section refers to the semi-leptonic sample, 
and, in contrast to all the other samples, the pt cut is applied to only one of the two light-quark jets. 
Since its cross section is so low, it will not be considered further. 
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Figure 2: Fraction of X+ljet events where the jet is uds quark (bottom and blue in each plot) as 
compared to gluon (top and red). The horizontal axis is a px cut on the jet, which in these events 
translates into an identical pt cut on the other object. 
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Figure 3: Fraction of X+2jet events where the jets are both light quark 'QQ' (bottom blue) vs one 
light quark one gluon 'QG' (middle purple) vs both gluon 'GG' (top red). Notice j + GG almost never 
happens, nor does b + QQ. These are starting points for quark and gluon purification. The horizontal 
axis is a px cut on all jets, while the other objects (6, 7, and leptons from Z/W) have pr > 20 GeV. 
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Figure 4: Division of the multijet (dominantly QCD) sample. The horizontal axis is a pr cut on all 
jets. Notice that all three jets are almost never all quark, and in the 4-jet sample, there are almost 
always at least two gluons. The 3-jet sample will be a staring point for gluon purification. 
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Figure 5: The chance that a given jet is a light quark jet rather than a gluon jet. (This ratio does 
not include bottom or charm.) The W and Z were nearly identical and combined on this plot, but 
they are slightly different from the photon, mostly due to the 7 and lepton cuts. 



1.000 r 

0.500 - 



0.010 - 
0.005 - 



0.001 L 



Fraction where ALL Jets are Quark 
Y+lj . z/w+2j_, 




200 400 800 1600 

Py Cut on All Jets (GeV) 



Fraction where ALL Jets are Gluon 




100 200 400 800 1600 

Py Cut on All Jets (GeV) 



Fraction where HARDEST Jet is Quarl< 



Fraction where SOFTEST Jet is Quark 



80% 



60% 



40% 



20% 



200 400 800 1600 

Pr Cut on All Jets (GeV) 



100%r 




























^^^^^^^^^^ 





200 400 800 1600 

Pr Cut on All Jets (GeV) 



Figure 6: The top row shows the fraction of events where all jets are quark or gluon, on a log scale. 
The bottom row shows the fraction where the highest px jet is quark, and where the lowest pt jet is 
quark, on a linear scale. (One minus this fraction are gluon jets.) Having more jets allows for more 
kinematic handles and potentially better purity. 
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3. Purifying the samples 

In this section, we consider how to improve the purity by judicious kinematic cuts. It's 
actually quite challenging to get high purities, as we will see. For example, if you start with 
a 50% pure quark sample and you find a set of cuts that reject two gluons for every quark 
kept, your new purity is not 75%, but only 66%. To reach 75%, you need a cut that rejects 
three gluons for every quark. 

Any cut will have some efficiency Eq to keep quark jets and a different efficiency Eg to 
keep gluon jets. Let q be the starting fraction of events where the jet in question (e.g. the 
lower pt 'softer' jet) is a light quark, and g = 1 — q the fraction of events where it is a gluon. 
Then, after a cut. 



Say we want to optimize the quark purity. One particular cut on the set of kinematic variables 
will be the best cut for a particular quark efficiency Eg. This will be the cut that lowers the 
gluon acceptance Sg clS much as possible. 

To reach a given quark purity, it obviously helps to start with a sample that's mostly 
quark. But it is possible to find effective kinematic cuts that improve a mediocre quark purity. 
This is the case in the 7+2jet sample. Strong cuts can increase the quark purity quite a bit 
for some samples, but at the cost of a much lower cross section. In the following, we will be 
careful to express our results as the cross section for quark and gluon jets with a given purity. 

3.1 Quark jet purification 

We begin by discussing purifying quark jets. As can be seen in Figure 0, the 7+ljet sample 
appears to be a good starting point, with roughly 80% quarks. This fraction is just the fraction 
of direct photons produced in the annihilation channel qq g'j (20%) versus the Compton 
channel qg — )• q'j (80%), which is in turn set by the gluon and q PDFs. Since the gluon PDF 
is larger than the q PDF in a proton, the Compton channel dominates. Unfortunately, the 
1-jet samples, such as 7+ljet or VF/Z+ljet, do not leave many options for kinematic cuts. 
Rapidity cuts do not do much, since at high px, the jets are more-or-less central, and the 
cross sections are basically fixed by the PDFs. In fact, the quark purity saturates at roughly 
88%. Thus, is helpful to have additional jets to get an additional handle on the kinematics, 
which will lead us to purities approaching 100%. 

We turn next to the the next best sample, 7+2jets. Note that VF/Z+2jets is kinemati- 
cally very similar, but since it has a smaller starting cross section, we focus on the photon. 
The rapidity distributions for the photon and the softer and harder jets in the samples are 
shown (for px ^ 200 GeV) in Figure |^. These ID distributions look like they contain some 
information, but there is in fact more information in their correlations. Figure ^ shows the 2D 
distribution of the rapidity of the harder jet and the rapidity of the photon. The likelihood 
map constructed from these distributions is shown in the third panel. Contours of constant 
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Figure 7: To purify quarks, the best starting point is the softer jet in the 7+2jet sample. The rj of 
the photon (left) along with the harder (center) and softer (right) jets look different when the softer 
jet is a quark (blue solid) vs a gluon (red hashed). These distributions are normalized to equal area. 
(200 GeV sample shown) 
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Figure 8: For the quark-heavy 7-|-2jet sample, a 2D version of last figure's first two histograms: ry-y 
of the photon vs rjj-^ of the harder jet. The left histogram is for when the softer jet is a quark, and 
the center histogram is for when the softer jet is a gluon. Though we are trying to purify the softer 
jet, it's best to cut on 77-y and rjj-^ of the harder jet. ^From the left histogram it's clear that when the 
softer jet is a quark, the harder jet is quite central and the photon's \ri\ is higher and uncorrelated. 
When the softer jet is a gluon, the harder jet is often toward the edge of our rjj cut, with the photon 
nearby in rj. Correlations are lost of one takes the absolute value of these rjs. The likelihood ratio on 
the right combines each bin as q/{q + g), with blue being more quark-like. When the photon and 
harder jet are widely separated in 77, the softer jet is likely quark. (200 GeV sample shown) 



likelihood are very well approximated as contours where the product of the rapidities rj^ijji 
is constant, as shown in Figure ^. The quark/gluon discriminant for this product variable is 
also shown in Figure ^. It clearly has more discrimination power than any of the individual 
rapidities. 

Another option for the 7-|-2jet sample is to consider the AR's between the photon and 
the jets. Due to a collinear singularity in q —?■ 57, it is natural to expect the photon to be close 
to one of the quarks. This is in contrast to the gluon case, since there is no g — )• 57 vertex. 
The distribution of AR between the photon and each jet is shown in Figure |l^. Performing 
a similar 2D likelihood analysis as with just the rapidity inputs, we find the that the single 
variable rj^rjji + AR^j^ does very well. Its distribution is also shown. 
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Figure 9: Cutting on any contour in the 2D likelihood distribution above is statistically the optimal 
discriminant for each quark efficiency, given only these two variables. The contours are roughly given 
by 77-y 77jj , which is plotted here on the left. We will see that this single variable captures most of the 
discrimination power of the full 9D multivariate likelihood estimate. On the right is the distribution 
of this product. (200 GeV sample shown) 




Figure 10: For the quark-heavy 7-|-2jet sample, distance between the photon and the harder jet 
(left) and softer jet (center). Notice that the photon is often as collinear with the softer jet as our 
Ai?^j2 > 0.5 cut allows. Doing the same 2D likelihood examination as before, an even better single 
variable discriminant is: r/^ rjj-^ + AR-^j^ , a combination of the product of rj of the photon and harder 
jet, plus the distance to the softer jet. The distribution of this mixed variable is shown on the right. 
(200 GeV sample shown) 



In constructing unusual variables like r/^r/ji + AR^j^, it is natural to wonder if we are 
being sufficiently comprehensive. Considering that for a sample with n final-state on-shell 
quarks and gluons, there are only 3n degrees of freedom, it is possible simply to put these 
6, 9 or 12 variables into a multivariate analysis. (Transverse momentum conservation and 
rotational symmetry can reduce the number of degrees of freedom by 3, but it does not hurt 
to include some redundant information.) More precisely, we input the {pT, f], (p) of each object 
at a Boosted Decision Tree, which is easy to do with TMVA package for ROOT p6[ |. 
The results can be taken as a best case, to which our single variable cuts can be compared. 
(To be honest, we arrived at this single variable partly by observing which variables TMVA 
found most important). 



The results of the multivariate analysis for quark jet purification are shown in Figure 11 



On the left side is the results for 200 GeV jets, cutting on the BDT output. Note that, as 
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Figure 11: Cross section as a function of quark purity. The left panel shows the purity for the 
different samples with a 200 GeV cut on all non-5 jets. The different points correspond to different 
cuts placed on a Boosted Decision Tree output, trained to optimize the quark purity. The leftmost 
dots of each sample are the uncut purities, and each successive dot corresponds to cutting the number 
of events in half. By the final dot, which keeps f/f28*'^ of the signal, cutting harder no longer increases 
the purity. The right panel shows the purities for the 7-l-ljet (red) and 7+2jet (blue) samples for 
various pr's, where the cuts are with BDTs trained on 6 and 9 kinematic variables, respectively. The 
black curves correspond to purities obtained after cutting on the single variable "q-yiij^ + ^R-yj^ ■ The 
blue curve takes the jet closest to the photon as a starting point, whereas the black curve takes the 
softer of the two jets as its starting point. This is the reason for the lower initial purity but the same 
cross section. (It was easier to find a single variable using the softer jet rather than the jet closer to 
the photon.) 



anticipated, the 7-l-ljet cannot be purified much — putting harsher cuts hits a wall and 
eventually just kills the cross section. On the right, we focus on just the 7-l-ljet and 7-|-2jet 
samples for all px- The red curves are the BDT output using 6 inputs for 7-l-ljet, the 
blue curves BDT with 9 inputs for 7-|-2jets, and the black curves for our single variable 
V-yVji + Ai?^j2. It is nice that the single variable does as well as the comprehensive analysis 
using the 9 BDT inputs. 

We conclude that the best way to get a clean quark sample at low pT is to use 7-l-ljet, for 
simplicity, or 7-|-2jets at moderate to large pT-, cutting on the single variable rj^rjji + AR^j^. 
Depending on how much cross section you are willing to sacrifice, for 200 GeV jets, you can 
get 95% quark purity at 2pb or 99% purity at 500 nb. 

3.2 Gluon jet purification 

Next, we turn to the more difficult case of gluon jet purification. It is more difficult because 
there is no starting sample with purity above 80%, and because there are no simple physically 
motivated handles for purification. Indeed, for the quark, we used the fact that there is a 
collinear 57 singularity but no singularity to inspire a ARj^ cut. But for a gluon we 
cannot use the gq singularity since we are trying to avoid q jets all together. The exception is 
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200 GeV Gluon Purity 




Figure 12: Cross section as a function of gluon purity for the different samples with a 200 GeV cut 
on all non-6 jets. The different points correspond to different cuts placed on a Boosted Decision Tree 
trained to optimize the gluon purity. The leftmost dots of each sample are the uncut purities. There 
are 3 curves for the 3-jet samples, and two for the &-|-2jet samples, corresponding to which of the jets 
(from hardest to softest) is being considered. Note the three 3-jet samples start with identical cross 
sections, but higher purities are achievable for the softer jets. 



samples with jets and 6's, where we can use 6-tagging information to help purify the sample. 
This will in fact be relevant, but we will find that the 3- and 4-jet samples actually work quite 
well, and avoid having to deal with 6-tagging. 

To begin, we start with a multivariate BDT analysis using as inputs the {pT, r], (p) of all 



final state particles. The results for the different 200 GeV samples are shown in Figure 12. 
We can see that while the 6-|-2jets has good efficiency, it also has a cross section orders of 
magnitude smaller than the 2-jet sample. The 3-jet sample is somewhere in between, with 
efficiencies about 80% for a cross section of 100 pb. We will consider these three samples in 
the following, as there may be situations when each one is advantageous. 

First, consider the 6-|-2jet sample. Looking back at Figure ^, we see that there is a 
contribution from both 'GG' (with ggb final states) and 'QG' (with qgb final states). The ggb 
section obviously has perfect gluon efficiency regardless of cuts. The main parton level process 
contribution in the qgb channel is ub — )• ubg, which looks like final state gluon radiation from 
t-channel ub — )• ub. Since we put a harder cut on the u and g than the b, the kinematics 
will mostly have the u going back-to-back with the gb, and so the g will be somewhat softer. 
This explains why the starting efficiencies for the softer jet at pT=200GeV are around 73%, 
versus 63% for the harder jet, as shown in see Figure 
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Figure 13: Gluon purities for the dijet and trijet samples for different pr's. For each pr sample, the 
first dot on the top-left represents the starting purity and cross section with no kinematic cuts. 
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Figure 14: To purify gluons in the 3-jet sample, we look at the softest jet, which tends to be central. 
It's rj is shown on the left. An even better discriminant takes into account the separation of the harder 
two jets and the correlation between this separation and the softest jet's is shown in the center. A 
good single variable capturing the likelihood contours is \rij^ \ — |?7jj — rjj^ \ whose distribution is shown 
on the right. (200 GeV sample shown) 



The main complication in the 6-|-jets samples is efficient 6-tagging. So far, we have 
assumed perfect 6-tagging, so that both jets are effectively anti-6-tagged. In reality, 6-tagging 
can be made very tight, keeping only jets that really look like 6-jets or really look like non-6- 
jets. A very tight 6-tag will lower the cross section without affecting the purities shown. If 
looser 6-tagging is used, the cross section will be higher but mistags of jjj and mis-anti-tags 
of bbj make the analysis more complicated. Note, however, that the dominant background 
to 6-jets are charm jets and from the point of view of finding gluon jets, it is ok to treat 
charm jets as 5-jets. In many ways 6-jets act like gluon jets rather than like light quark 
jets. For example, the OPAL experiment at LEP |17] found 5-jets to have more charged 
particles over a wider area than light quark jets, making them similar to gluon jets in this 
regard. It is therefore very important to have tight anti-6-tagging on any jet used in further 
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Figure 15: Cross section as a function of gluon purity. The gray curve shows how pure the sample 
could be made using all the kinematic information, through a Boosted Decision Tree. The green curves 
show the result of cutting on the rapidity of the softest jet r]j^ . The red curve shows that by cutting 
on the single variable I77J3I — \rij-^ — r]j^\, nearly optimal purities can be achieved, matching the BDT. 
Note, all three curves agree at their left-most points, where no cut is applied. 
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Figure 16: Cross section as a function of gluon purity. The gluon tagging efficiencies for the dijets 
(black), trijets (gray), and 6-|-2jets (orange) are shown. All curves correspond to the result of an 
optimal purification using a multivariate analysis. Nearly optimal results can be reproduced in the 
trijet variable with a simple cut on a single kinematical variable, as described above. 
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analysis, no matter which starting sample it came from. Since 6-tagging is very detector and 
Pt dependent, we do not attempt to include it in any quantitative way in this tree-level study. 

Next, consider the dijet and trijet samples. There is actually a fairly strong pT dependence 
in the gluon fractions, as can be seen in Figure]^. As before, we begin by using full kinematic 
information in Boosted Decision Trees. The result is shown in Figure 13. We see that dijets 
have a higher cross section, but cannot be purified beyond a limiting value. The trijet sample 
can be purified more, but has a lower cross section since its softest jet must be above the 
indicated pt- While the efficiencies are not as high as in 6-|-2jets, the trijet sample can provide 
90% gluon purity with large cross sections and few 6-tagging worries. A similar analysis can 
simplify the kinematic cuts to a few variables. 

The best single simple variable to cut on for the softest jet in the trijet sample is the 
rapidity of that jet, 77^3. Its distribution is shown in the left panel of of Figure 0, where we 
can see that the softest jet tends to be central when it is a gluon and more forward when it 
is a quark. Unfortunately just cutting on the rapidity of the softest jet can only do so well 
in purifying the sample. This can be seen from the distributions - there is no region which is 
pure gluon. To be more quantitative, the effect of cutting on rjj^ is shown in Figure 15. The 
green, representing cuts on 7/^3 hits a hard wall for each px- 

To progress further, we observe that 77^3 is only weakly correlated with the rapidity 
difference of the other two jets, \rij2 — r]ji\. The 2D distribution and the likelihood contours 
are shown in the center of Figure 14. These contours are well mapped by {rjjsl — \r]j2 — 
which we take as our best composite variable. Its distribution is shown on the right of this 
figure. Note that, in contrast to 7/^3, the distribution of this composite variable has a gluon 
tail toward negative values. Thus, it should be possible to put very hard cuts on it to improve 
efficiency. The result is shown and contrasted to the full BDT and rjj^ results in Figure 15, We 
see that cutting on this variable does nearly as well as using the full kinematic information. 

The results for the dijet, trijet and 6-|-2jet samples are summarized in Figure 16. To get 
very high ~99% gluon efficiencies, one needs the 6-|-2jet samples with excellent 6-tagging. But 
at 80% or 90%, one can instead use trijets cutting on the discriminant \r]j3 \ — \r]j2 — r]ji\. The 
trijet sample has a much larger cross section than 6-(-2jets for the lower jet pT samples. 



4. Defining quark and gluon jets in QCD 

In this section, we discuss what exactly is meant by quark and gluon jets. We begin by 
considering particle decays, since they provide a context in which the concept of quark and 
gluon jets is more intuitive. We then discuss how soft and collinear radiation preserves the 
identity of a jet as quark or gluon, and how quark and gluon cross sections can be defined 
beyond leading order. 

Consider a Z boson which decays to 2 jets. In the limit that the jets are highly collimated 
and well separated, these jets are 100% quark jets. This is not to say that there are no gluons 
represented in the jets — beyond leading order in perturbation theory there will be many 
gluons, and these gluons can have as much energy, or more, than the quarks — but the jets 
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coming from the Z-decay are still quark jets, by definition. (There is actually zero probability 
for the jets to be gluon jets in this case due to Yang's theorem.) One could also imagine a 
particle which would decay only to gluon jets, for example, a light Higgs boson that only 
couples directly to the top (the decay would be through a top-loop). Here, the jets would 
unambiguously be 100% gluon jets. If a particle decays to 3 jets, one can ask about the quark 
and gluon content of the third jet as well. This would also be well-defined to the extent that 
the jets are collimated and separated, which is the same extent that the jets are representative 
of the hard interaction at all. In a multiparticle cascade decay with many jets, such as in 
supersymmetry, one can also ask unambiguously about the quark or gluon jet content of the 
various jets produced. In fact, even in QCD processes, such as pp — >dijets the concept of 
quark and gluon jets is no more ambiguous than in decays, one is just less used to thinking 
about quark and gluon fractions. 

When jets are highly collimated and well separated, their cross sections factorize into the 
production process, for which there is no mixing between quarks and gluons, and the frag- 
mentation process, whereby those quark and gluon jets shower and hadronize into observable 
particles. Although exact factorization proofs are not available for anything but the simplest 
process (Droll- Yan), scaling arguments suggest that any violations to factorization should be 
negligible. Thus, the concept of quark and gluon jets is a well-defined theoretical concept up 
to power corrections that scale as Aqcd/-E' and R ~ m/E, where R is the size of the jet, E 
its energy and m its mass. 

As mentioned in the introduction, there is no ambiguity at leading order in defining the 
fraction of quark and gluon jets in any exclusive sample. To be precise, leading order here 
means the Born level, the lowest order in perturbation theory which produces the required 
number of jets. To be concrete, consider for example the direct production of a hard photon, 
say with pT > 200 GeV. At leading order, there are two Feynman diagrams, the Compton 
channel: qg — )■ and the annihilation channel qq — >■ g^. The ratio of the cross sections for 
these channels, at leading order, tells us that 85% of the jets produced in association with a 
photon will be quark jets. For more complicated processes there is also no ambiguity as long 
as we are specific aboTit which jet we mean, in an infrared safe way. For example, we can ask 
about the 2nd hardest anti-fcy R = 0.4 jet in I^+jets events. Here, the Born level is W+2 
jets, and the cross section ratio can be computed unambiguously (up to scale uncertainties) 
at leading order. 

At next-to-leading order, there are virtual and real contributions. Both of these are 
infrared divergent and some part of the real contributions must be added to the virtual to get 
a finite answer. The virtual graphs have the same number of jets as the Born level, and so 
whether they contribute to the quark or gluon jet cross section is similarly unambiguous. The 
real graphs can be split into a contribution containing the infrared divergent regions and a 
hard remainder. The infrared divergences are soft or collinear, and in either limit the identity 
of the jet as quark or gluon is conserved. In the soft limit, the interactions of gluons are 
Eikonal and factorize off, again leaving the quark or gluon nature of the jet unchanged. In 
the collinear limit, helicity is conserved. So one can treat the helicity of a jet as a conserved 
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quantum number which is necessarily different for quark and gluon jets. Moreover, for any 
infrared-safe jet definition, a colhnear gluon emitted in the singular region must go into the 
jet, so the overall baryon number of the jet (number of quarks minus number of antiquarks) 
is conserved. Hard emissions must produce another jet, at least in the approximation where 
the jets are highly collimated, which is where factorization holds. ^ So the infrared-singular 
parts of the real emission contributions do not change whether the jet is quark or gluon and 
therefore the quark or gluon fraction can be defined at higher orders in perturbation theory. 

To all orders in perturbation theory, the factorization into quark and gluon production 
can be simplified by the use of operators in Soft-Collinear Effective Theory pl| , p2| . For 
example, for direct photon production ||2^, there are 6 production channels, with initial 
states qq, qq, qq, qg, gg and qg. Each channel has two spin structures, corresponding to the 
cases when the quarks have equal or opposite spin. For example, in the qq — ?■ 57 channel, the 
operators are 

0^g^ = X2A'ixu O^y = X2a^uA^^xi, (4.1) 



So there are 12 operators total relevant for matching at the Born level. The fields x A 
are collinear quarks and gluons with associated collinear Wilson lines. For simplicity, these 
are called jet fields. More details can of the notation can be found in [23|. 

The point of the SCET notation is that it gives a precise definition to what we have been 
calling quark and gluon jets. It therefore lets us define the quark and gluon jet fractions 
exactly, as ratios of matrix elements of operators with quark or gluon jet fields. In the limit 
where factorization holds, there is no mixing between operators with different jet fields, or 
even of fields with different spin. For example, in direct photon, when the photon is very 
energetic there is only phase space for it to recoil against a single jet. In this limit, the process 
is exactly described by the operators in Eq. (4T) and the other 10 operators for the other 
channels. The mixing between the operators is power suppressed. To add some concreteness 
to the discussion, at leading order, the jet recoiling against at 300 GeV photon is 82.3% quark. 
At NLO, it is 84.6% quark and at NNLO 85.1% quark. The leading order prediction is a very 
good approximation to more precise values, since the radiative corrections largely drop out 
of the fraction. 

In summary, in this section we have explained how the quark and gluon jet fraction is 
exactly defined in a limit in which the production of jets factorizes into an incoherent sum 
of different channels. This gives precisely calculable cross sections, and hence a well-defined 
quark-to-gluon jet fraction. 



^There may be additional "non-global" contributions, from configurations where a hard gluon splits into 
two quarks and one of those ends up a jet. Whether non-global logs are relevant or not is a question about the 
observable, such as the jet mass, not about whether the jets are quark or gluon. Quark or gluon jets are defined 
to the extent that factorization holds, and non-global logs would violate factorization. More information on 
non-global logs can be found in |l^, po[ . 
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5. Conclusions 



In this paper, we have systematically explored which processes at a proton collider can be 
exploited to give pure samples of quark and gluon jets. We found that a 98% pure quark jet 
sample is achievable by starting with the softer jet in 7+2jets and cutting on the combined 
kinematic variable i^'yTij-^^ + AR^j,^. The corresponding cross sections are around 10 pb for 
PT > 100 GeV, Ipb for pT > 200 GeV, or 0.1 pb for pT > 400 GeV quark jets. More quark 
purity information is in Figure |l^. 

Gluon jets are more difficult to purify. We found that the 6+2jets sample provides the 
best results under ideal conditions. Unfortunately, to get such pure gluon jet samples requires 
a excellent 6-tagger, and a realistic analysis can only be done with details of the particular 
experiment and 6-tagging method. The next best thing, is to use the softest jet in 3jet events. 
This has a higher cross section than the 6+2jets sample, but cannot achieve quite as high 
purities. Cutting on the combined variable \r]j3\ — \r]j2 — r]ji\, the trijet sample can provide 
100 pb at 93% purity for 100 GeV gluon jets, Ipb for 90% purity 200 GeV jets, or 10 fb of 
85% purity 400 GeV jets. More gluon purity information is in Figure 16. 

The fraction of quark and gluon jets, which we have calculated in this paper at leading 
order in perturbation theory, is well-defined theoretical concept, up to power corrections in the 
jet size. These power corrections are suppressed when the jets are hard and well-separated. 
The quark-to-gluon jet fraction is a theoretical concept, not directly observable, but it is 
an extremely useful theoretical concept. The observables are the jet properties in a given 
sample, which correlate with the quark or gluon jet fraction. These properties, such as mass 
of the hardest jet, can in principle also be calculated. Certain regions of phase space, the 
ones with pure samples of quark or gluon jets discussed in this paper, should allow us to 
test calculations and calibrate simulations of jet properties more efficiently. With the better 
experimental handle on jet properties arising from the study of these samples, we will be better 
prepared to extract properties of fundamental standard-model or beyond-the-standard-model 
physics encoded in hadronic events. 
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